Problem Set 1 - Solutions

library(tidyverse)
my_theme <- theme_bw() +
  theme(
    panel.background = element_rect(fill = "#f7f7f7"),
    panel.grid.minor = element_blank(),
    axis.ticks = element_blank(),
    plot.background = element_rect(fill = "transparent", colour = NA)
  )
theme_set(my_theme)

London Olympics

Scoring

  • a - b, Design (1 points): Creative and readable (1 point), generally appropriate but with some lack of critical attention (.5 points), difficult to read (0 points)
  • a - b, Code (0.5 points): Clear and concise (0.5 points), correct but unnecessarily complex (0.25 points), missing (0 points)
  • c, Design and Discussion (1 points): Creative question, solution, and interpretation (1 point), appropriate question, solution, and interpretation, but perhaps simplistic question / difficult to read design / underdeveloped interpretation (0.5 points), misleading design or no interpretation (0 points)
  • c, Code (0.5 points): Clear and concise (0.5 points), correct but unnecessarily complex (0.25 points), missing (0 points)

Question

The data at this link describes all participants in the London 2012 Olympics.

Example Solution

  1. Create a layered display that shows (i) the ages of athletes across sports and (ii) the average age within each sport.

    olympics <- read_csv("https://uwmadison.box.com/shared/static/rzw8h2x6dp5693gdbpgxaf2koqijo12l.csv")
    averages <- olympics %>%
      group_by(Sport) %>%
      summarise(Age = mean(Age))
    
    ggplot(olympics, aes(Age, Sport)) +
      geom_point(position = position_jitter(h = .2), size = 0.5, col = "#63CAF2") +
      geom_point(data = averages, col = "#184059")

  2. Sort the sports from lowest to highest average age.

    ggplot(olympics, aes(Age, reorder(Sport, Age))) +
      geom_point(position = position_jitter(h = .2), size = 0.5, col = "#63CAF2") +
      geom_point(data = averages, col = "#184059") +
      labs(x = "Age", y = "Sport")

  3. Develop one new question based on these data. What makes you interested in it? Provide a visualization that supports the comparisons needed to arrive at an answer.

There are many possible solutions to this problem. Some potential questions of interest include,

  • Which countries win the most medals overall?
  • Which countries win the most medals in which sports?
  • Which athletes won the most medals?
  • How does athlete age vary across both sport and gender?
  • How many athletes were born outside of the country that they competed for?

Here is an example design for the first question. Some of the interesting findings include: (i) Some countries have much larger (or lower) proportions of Gold medals, in spite of lower overall medal count (e.g., Germany and Canada), (ii) there is a long tail of countries with 1 - 2 medals. We could imagine faceting by a few of the major sports (using fct_lump to group the rare ones), but we would want to reorder separately within each facet (this will be covered in the module on text data).

olympics %>%
  group_by(Country) %>%
  summarise(across(Gold:Bronze, sum)) %>%
  pivot_longer(-Country, names_to = "Medal") %>%
  filter(value > 0) %>%
  mutate(Medal = factor(Medal, levels = c("Bronze", "Silver", "Gold"))) %>%
  ggplot() +
  geom_col(
    aes(value, reorder(Country, value, sum), fill = Medal),
    width = 1
  ) +
  scale_fill_manual(values = c("#cd7f32", "#c0c0c0", "#ffd700")) +
  scale_x_continuous(expand = c(0, 0, 0.1, 0)) +
  labs(
    x = "Medal Count",
    y = "Country"
  )

Pokemon

Scoring

  • a - c, Design (1 point): Creative and readable (1 point), generally appropriate but with some lack of critical attention (.5 points), difficult to read (0 points)
  • a - c, Code (1 point): Clear and concise (1 point), correct but unnecessarily complex (0.5 points), missing (0 points)
  • d, Discussion (1 points): Creative and well-developed discussion which references course concepts (1 point), appropriate discussion but potentially underdeveloped (0.5 points), vague or unclear proposal (0 points).

Question

This problem gives practice in deriving new variables to improve a faceted plot. The data below give attack and defense statistics for Pokemon, along with their types. We will build a visualization to answer the question – how do the different types of Pokemon vary in their attack and defense potential?

Example Solution

  1. Derive a new column containing the attack-to-defense ratio, defined as \(\frac{\text{Attack}}{\text{Defense}}\).

    pokemon <- read_csv("https://uwmadison.box.com/shared/static/hf5cmx3ew3ch0v6t0c2x56838er1lt2c.csv") %>%
      mutate(attack_to_defense = Attack/Defense)
  2. For each type_1 group of Pokemon, compute the median attack-to-defense ratio.

    group_ratio <- pokemon %>%
      group_by(type_1) %>%
      summarise(group_ratio = median(attack_to_defense)) %>%
      arrange(-group_ratio)
  3. Plot the attack vs. defense scores for each Pokemon, faceted by type_1. Use the result of (b) to ensure that the panels are sorted from types with highest to lowest attack-to-defense ratio.

    pokemon %>%
      mutate(type_1 = factor(type_1, levels = pull(group_ratio, type_1))) %>%
      ggplot(aes(Defense, Attack)) +
      geom_abline(slope = 1, col = "#BFBFBF") +
      geom_point(size = 0.8) +
      facet_wrap(~ type_1)

  4. Propose, but do not implement, a visualization of this dataset that makes use of dynamic queries. What questions would the visualization answer? What would be the structure of interaction, and how would the display update when the user provides a cue?

A variety of answers could be provided for this problem. Some potential query / inputs that could be supported include,

  • Allow users to select the type_1 group of pokemon from a dropdown menu, so that the data do not need to be faceted.
  • Create a histogram of attack-to-defense ratio, allowing users to graphically query samples with especially low or high ratios
  • Create scatterplots of other features, like speed or HP, and link graphical queries on this scatterplot with the original plot above (or with a table).

Soccer Code Review

Scoring

  • Discussion [1.5 points]: Specific, accurate, and well-written feedback to colleague (1.5 points), accurate but incomplete or imprecise feedback to colleague (0.75 points), highly underdeveloped feedback (0 points)
  • Alternative Design [1 point]: Design appropriately addresses earlier critique and is highly polished (1 point), design appropriately addresses earlier critique but shows some lack of attention to detail (0.5 points), design fails to address earlier limitations (0 points)
  • Code (0.5 point): Correct and readable code (0.5 points), either incorrect or unreadable code (0 points).

Question

This exercise asks you to conduct an imaginary code review. These are often used in data science teams to,

  • Catch potential bugs
  • Make sure code is transparent to others
  • Create a shared knowledge base

It is important to be perceptive but friendly.

  • Can the code be made more compact?
  • Are there visual design choices / encodings that could be refined?
  • If your colleague did something well, say so!

They can also be a great way to learn new functions and programming patterns. Unlike standard code-reviews, I ask you to give an example implementing your recommendations.

Specifically, in this review, suppose you are working on a sports blog, and your colleague is soccer interested in which teams won the most games in a few European leagues over the last few years. They have written the code below. Provide your code review as a set of bullet points, and include code giving an example implementation of your ideas. The original data are from this link.

win_props <- read_csv("https://raw.githubusercontent.com/krisrs1128/stat479_s22/main/exercises/data/understat_per_game.csv") %>%
  group_by(team, year) %>%
  summarise(n_games = n(), wins = sum(wins) / n_games)

best_teams <- win_props %>%
  ungroup() %>%
  slice_max(wins, prop = 0.2) %>%
  pull(team)

win_props %>%
  filter(team %in% best_teams) %>%
  ggplot() +
  geom_point(aes(year, team, size = n_games, alpha = wins))
An example figure for code review.

An example figure for code review.

Example Solution

  • The code is already reasonably concise. One minor point is that we could directly calculate the fraction of wins by using mean, rather than computing the sum and dividing by the total,

    read_csv("https://raw.githubusercontent.com/krisrs1128/stat479_s22/main/exercises/data/understat_per_game.csv") %>%
      group_by(team, year) %>%
      summarise(wins = mean(wins))
  • Though the code is clear, the visual design needs substantial improvement. It is challenging to make comparisons between similar values using the size / transparency visual encoding. A positional (faceted bar) or color (heatmap) encoding would have been much easier to read. On a similar note, this figure is potentially trying to do too much: it would have been clearer to consider either the n_games or wins statistics, but not both at once.

  • It is a poor choice to sort the teams alphabetically. It would have been more meaningful to sort by some attribute of the teams.

  • There is substantial whitespace between years. The amount of actual data encoded in this large figure space is relatively small.

Here are three alternative designs suggested by this review. The first is the most similar to the original figure, but replaces the points with a heatmap of win fraction. It also sorts the teams from those with the highest average win fraction. It ignores the total number of games played. The second zooms into just the teams with the most wins, assuming that readers are most likely to be interested in the strongest teams. The final approach encodes the year as color, focusing attention on the win fraction.

ggplot(win_props) +
  geom_tile(aes(year, reorder(team, wins), fill = wins, col = wins)) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_discrete(expand = c(0, 0)) +
  scale_fill_gradient2(midpoint = 0.5) +
  scale_color_gradient2(midpoint = 0.5)

library(ggrepel)
ggplot(win_props, aes(n_games, wins)) +
  geom_point() +
  geom_text_repel(aes(label = team), size = 4) +
  ylim(0.7, 0.85) +
  facet_wrap(~ year)

ggplot(win_props) +
  geom_point(aes(wins, reorder(team, wins), size = n_games, col = year)) +
  scale_size(range = c(0, 2)) +
  theme(axis.text = element_text(size = 8))

Visual Redesign

Scoring

  • a - b (1 point): Accurate and complete analysis of visualization’s goals, using concepts introduced in class (1 point), generally accurate, but potentially vague or poorly referenced, analysis (0.5 points), little evidence of specific analysis (0 points).
  • c (1 point): Critical and insightful analysis of past visualization’s limitations (1 point), generally correct analysis but failing to observe important limitations (0.5 points), imprecise or poorly elaborated analysis (0 points).
  • d, design and code (1.5 points): Substantive improvements in new design and elegant code (1.5 points), appropriate design and readable code (0.75 points), negligible changes in design or unreadable code (0 points).
  • d, discussion (1.5 points): Benefits of new design are discussed clearly and refer to concepts from class (1.5 points), benefits of design are discussed imprecisely (0.75 points), missing discussion (0 points).

Question

In this exercise, you will find a visualization you have made in the past and redesign it using the skills you have learned in this course.

  1. Identify one of your past visualizations for which you still have data. Include a screenshot of this past visualization.
  2. Comment on the main takeaways from the visualization and the graphical relationships that lead to that conclusion. Is this takeaway consistent with the intended message? Are there important comparisons that you would like to highlight, but which are harder to make in the current design?
  3. Comment on the legibility of the original visualization. Are there aspects of the visualization that are cluttered or difficult to read?
  4. Propose and implement an alternative design. What visual tasks do you prioritize in the new design? Did you have to make any trade-offs? Did you make any changes specifically to improve legibility.

Solutions

Solutions to this problem will vary. If you would like to discuss your specific visualization and redesign, please see the instructor. Two exceptionally good example solutions are included below.

Example Solution A

The solution below is due to Jonquil Liao.

  1. Identify one of your past visualizations for which you still have data. Include a screenshot of this past visualization.
chocolate <- read.csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv')

Previously I wanted to check if different cocoa percentage lead to different ratings. So I did:

library(tidyr)
cocoa_percent = extract_numeric(chocolate$cocoa_percent)
boxplot(chocolate$rating~cocoa_percent)

  1. Comment on the main takeaways from the visualization and the graphical relationships that lead to that conclusion. Is this takeaway consistent with the intended message? Are there important comparisons that you would like to highlight, but which are harder to make in the current design?

Cocoa_percent is shown in x-axis, chocolate ratings are shown as y-axis. Previously I concluded the ratings are generally higher when cocoa_percent is between 60-80% (medium percent), because on the graph, the ‘boxes’ are ‘higher’ for medium cocoa percent while ‘lower’ when the percentage is over 80 or below 60. I think is good to make a rating vs. percent boxplot to compare them, it roughly gives us a sense of how the rating changes across cocoa percent. But we can’t tell the size of each group, boxplot only shows how the data points spread out but does not indicate which boxes have bigger samples.

  1. Comment on the legibility of the original visualization. Are there aspects of the visualization that are cluttered or difficult to read?

The x-axis does not show all the scales, but I think that is fine. However, the boxplot can not clearly show the distribution of ratings.

  1. Propose and implement an alternative design. What visual tasks do you prioritize in the new design? Did you have to make any trade-offs? Did you make any changes specifically to improve legibility.
chocolate %>%
  group_by(rating) %>%
  count(cocoa_percent) %>%
  ggplot() +
  geom_point(aes(extract_numeric(cocoa_percent),rating,size = n, col = rating)) +
  labs(x = 'cocoa percent', y = 'rating')

I changed boxplot into scatterplot with the size of the dot indicating the number of samples in that category. Because in this plot, I want to stress the problem that boxplot cannot show sample size and hence we don’t know if there is truly higher rating among medium cocoa percent or it is just because of lacking of data points in higher and lower cocoa percent. From the new graph, I figure we cannot conclude higher rating for 60-80% chocolate anymore, there are barely data points < 60 or > 85, a biased conclusion may be developed based on these data.

Example Solution B

The solution below is due to Margaret Turner.

  1. Identify one of your past visualizations for which you still have data. Include a screenshot of this past visualization.
# Access data
soil_add <- c("Additive Concentrations",
              "(1)      1.9  2.1",
              "(1)      2.4  2.8",
              "(1)      1.4  1.6",
              "(2)      2.0  1.8",
              "(2)      1.2  1.2",
              "(2)      1.9  1.6",
              "(3)      2.9  3.0",
              "(3)      3.7  3.2",
              "(3)      2.2  2.2",
              "(4)      5.1  4.5",
              "(4)      3.3  3.0",
              "(4)      3.0  3.5",
              "") %>% 
  stringr::str_split(., " ") %>% 
  .[2:13] %>% 
  unlist(.) %>% 
  .[. != ""] %>% 
  matrix(., ncol = 3, byrow = TRUE) %>% 
  data.frame() %>% 
  mutate(pot = rep(1:3, times = 4)) %>% 
  tidyr::pivot_longer(., cols = c(X2, X3)) %>% 
  select(-name) %>% 
  transmute(additive = factor(X1),
            concentration = as.numeric(value),
            pot = factor(pot))

soil_add %>% head() %>% print()
## # A tibble: 6 × 3
##   additive concentration pot  
##   <fct>            <dbl> <fct>
## 1 (1)                1.9 1    
## 2 (1)                2.1 1    
## 3 (1)                2.4 2    
## 4 (1)                2.8 2    
## 5 (1)                1.4 3    
## 6 (1)                1.6 3
soil_add %>% 
  ggplot(aes(x = additive, y = concentration)) +
  geom_boxplot() +
  geom_jitter(size = 3, alpha = 0.75, width = 0.05) +
  theme_minimal() +
  labs(x = "Treatment", y = "Complex molecule concentration (ppm)")

  1. Comment on the main takeaways from the visualization and the graphical relationships that lead to that conclusion. Is this takeaway consistent with the intended message? Are there important comparisons that you would like to highlight, but which are harder to make in the current design?

This visualization was prepared to examine the effects of four soil additive treatments on the yield of a complex molecule in corn roots.

One critical aspect of the experimental design is that two yields were measured from each plant, but this visualization fails to convey this. Additionally, the x-axis is not helpful. The labels of the four treatments provide no information about the treatments themselves.

From the current visualization, the viewer can determine that some treatments seem to affect the molecular yield, but the viewer has no context for what the treatments are and is missing crucial information about subsampling.

  1. Comment on the legibility of the original visualization. Are there aspects of the visualization that are cluttered or difficult to read?

Jittering was used since some of the samples within the same treatment had equal yields (and, therefore, equal x and y values in this visualization). However, jittering such a small number of points over a boxplot looks kind of sloppy, especially because the points are so large. I had increased the size of the points to help them stand out from the boxplots.

  1. Propose and implement an alternative design. What visual tasks do you prioritize in the new design? Did you have to make any trade-offs? Did you make any changes specifically to improve legibility.

I had reservations about encoding plant grouping using color or point shape. There are 12 plants overall; 3 plants for each treatment group. A 3-color (or 3-point-shape) scale for plant might be misleading, as it implies some nonexistent connection between “plant 1” in each of the treatment groups. However, a 12-value scale would have too many colors (or point shapes) to have good contrast within each treatment group.

Therefore, I created scatterplots faceted by the soil additive treatments, using an arbitrary x-axis to separate the subsamples from each plant. Violin plots replace the boxplots from the first visualization. The violin plots are light green to help the points stand out without having to make them awkwardly large. Each faceted plot has a short description of the soil additive treatment (rather than a context-less number).

It is still not clear from the visualization itself that the x-axis breaks up subsamples (a caption would be required to explain this).

set.seed(1416) # To standardize the jitter

additive.labs <- c("Standard", "New", "New + 1% P", "New + 2% P")
names(additive.labs) <- c("(1)", "(2)", "(3)", "(4)")

ggplot(soil_add, aes(y = concentration)) +
  geom_violin(aes(x = "2", y = concentration), fill = "#1BC51B", alpha = 0.4) +
  geom_jitter(aes(x = pot), height = 0, width = 0.15, size = 2, alpha = 0.8) +
  facet_grid(
    ~additive, switch = "x",
    labeller = labeller(additive = additive.labs)
    ) +
  theme_bw() +
  labs(title = "Effect of soil additives on complex molecule yield",
       x = "Soil additive blend", 
       y = "Complex molecule concentration (ppm)") +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
    )  

Antibiotics Comparison

Scoring

  • Discussion (2.5 points): Complete and accurate (2.5 points), moderately developed and mostly accurate (1.25 point), insufficiently developed or broadly inaccurate (0 points)
  • Code (0.5 point): Correct and readable code (0.5 points), either incorrect or unreadable code (0 points).

Question

Below, we provide three approaches to visualizing species abundance over time in an antibiotics dataset.

antibiotic <- read_csv("https://uwmadison.box.com/shared/static/5jmd9pku62291ek20lioevsw1c588ahx.csv")
antibiotic
## # A tibble: 972 × 7
##    species  sample value ind    time svalue antibiotic     
##    <chr>    <chr>  <dbl> <chr> <dbl>  <dbl> <chr>          
##  1 Unc05qi6 D1         0 D         1   NA   Antibiotic-free
##  2 Unc05qi6 D2         0 D         2   NA   Antibiotic-free
##  3 Unc05qi6 D3         0 D         3    0   Antibiotic-free
##  4 Unc05qi6 D4         0 D         4    0   Antibiotic-free
##  5 Unc05qi6 D5         0 D         5    0   Antibiotic-free
##  6 Unc05qi6 D6         0 D         6    0.2 Antibiotic-free
##  7 Unc05qi6 D7         0 D         7    0.2 Antibiotic-free
##  8 Unc05qi6 D8         1 D         8    0.2 Antibiotic-free
##  9 Unc05qi6 D9         0 D         9    0.2 Antibiotic-free
## 10 Unc05qi6 D10        0 D        10    0.2 Antibiotic-free
## # … with 962 more rows

For each approach, describe,

  • One type of visual comparison for which the visualization is well-suited.

  • One type of visual comparison for which the visualization is poorly-suited.

Make sure to explain your reasoning.

Example Solution

antibiotic <- read_csv("https://uwmadison.box.com/shared/static/5jmd9pku62291ek20lioevsw1c588ahx.csv")
ggplot(antibiotic, aes(time)) +
  geom_line(aes(y = svalue), size = 1.2) +
  geom_point(aes(y = value, col = antibiotic), size = 0.8, alpha = 0.8) +
  scale_color_brewer(palette = "Set2") +
  facet_grid(species ~ ind) +
  guides(color = guide_legend(override.aes = list(size = 2, alpha = 1))) +
  theme(strip.text.y = element_text(angle = 0))

This figure is effective for,

  • Comparing abundances over time for each species and subject combination, even for rare species. It is easy to compare \(y\)-axis values within individual panels. Since the \(y\)-axis scales are not scaled, trends in even the rare species are visible.
  • Comparing species abundance across antibiotic treatment regimes. Since color is used to encode treatment regime, we can easily see how peaks or valleys coincide with the treatments.

This figure is ineffective for,

  • Comparing abundances for different species for the same subject. Since the \(y\)-axes scales are not shared, it is hard to compare abundances across species.
  • Ranking species by overall abundance within or across subjects. Again, this is a consequence of the unshared scales.
  • Comparing trends in species abundance across subjects (especially D vs. F). Since our eyes have to travel left and right to compare species trends, it is harder to evaluate differences across subjects, relative to if they were all overlapping, for example.
ggplot(antibiotic) +
  geom_tile(aes(time, ind, fill = value)) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_fill_distiller(direction = 1) +
  facet_grid(species ~ .) +
  theme(strip.text.y = element_text(angle = 0))

This figure is effective for,

  • For individual species, comparing trends over time across subjects. All the subjects are placed adjacent to one another within each panel, so our eyes don’t have to travel such a large distance to make the comparison.
  • Across species, recognizing shared increases or decreases at specific timepoints. Since the plot is so compact, all the values for a single timepoint are easily queryable.
  • Recognizing the species and samples with the highest abundances. The cells with the darkest colors pop out from among the rest.

This figure is ineffective for,

  • Comparing the absolute abundances of a single species over time. It is difficult to compare shades of the same color.

  • Evaluating the abundance of relatively rare species. These species all have light colors, and gradations smaller than the color scale bin size are not visible.

  • Comparing species abundances for a single subject. We have to move our eyes across the three panels to make comparisons about a single species.

    ggplot(antibiotic) +
      geom_line(aes(time, svalue, col = species)) +
      facet_grid(ind ~ .) +
      scale_color_brewer(palette = "Set2") +
      scale_x_continuous(expand = c(0, 0)) +
      labs(x = "Time", y = "Value") +
      theme(legend.position = "Bottom")

This figure is effective for,

  • Within a single subject, ranking species by overall abundance. We can easily see which colors lie above the others within any given panel.
  • Comparing abundance over time for a single subject and species. We can see increases and decreases clearly when plotting against a y-axis scale.
  • Comparing overall species abundances across subjects. Since the same y-axis scale is used across panels, we can conclude that some subjects have more counts overall.

This figure is ineffective for,

  • Comparing trends for a single species across subjects. It is visually challenging to match colors across the three panels.
  • Comparing trends for low abundance species. For low abundances, many of the lines overlap with one another.

Name App Bugs

Scoring

  • For each part, 0.25 points are awarded for a correct answer and 0.5 points are awarded for a thorough explanation.

Question

The following versions of the Name app all have errors that will raise an error if you try running them in R. For each part, isolate the line(s) that contain the bug. Provide a conceptual explanation for why the error occurred and how they could be prevented more generally.

Example Solution

  1. This program fails because all inputs must be referred to within reactive or render contexts. It is also nonsensical to try changing an input within the server – only outputs can be changed.

    server <- function(input, output) {
      input$name <- paste0("Welcome to shiny, ", input$name, "!")
    ...
  2. This program fails because the input and output IDs do not match for printed_name / printed_names. Changing the input$name input would have no effect on the printed_name output, because there is no matching output in the server.

    ui <- fluidPage(
      ...
      textOutput("printed_name")
    )
    
    server <- function(input, output) {
      output$printed_names <- renderText({
      ...
  3. This program fails because the rendered type for the printed_name output does not match what the UI is expecting. The UI is expecting a text output in the form of a character string, but the output is being provided a data.frame (rendered through renderDataTable) and not a string. Either the original UI would need to be changed to a DataTableOutput or the server would need to be modified to a renderText call.

    ui <- fluidPage(
      ...
      textOutput("printed_name")
    )
    
    server <- function(input, output) {
      output$printed_name <- renderDataTable({
    ...
  4. This program fails because the UI elements are not separated by commas. Each UI element in a fluidPage needs to be separated by a comma in order to render properly – Shiny cannot parse newlines within a function call.

    ui <- fluidPage(
      titlePanel("Hello!")
      textInput("name", "Enter your name")
      textOutput("printed_name")
    )